Global tree network for computing structures

ABSTRACT

A system and method for enabling high-speed, low-latency global tree communications among processing nodes interconnected according to a tree network structure. The global tree network optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the tree via links to facilitate performance of low-latency global processing operations at nodes of the virtual tree and sub-tree structures. The global operations include one or more of: global broadcast operations downstream from a root node to leaf nodes of a virtual tree, global reduction operations upstream from leaf nodes to the root node in the virtual tree, and point-to-point message passing from any node to the root node in the virtual tree. One node of the virtual tree network is coupled to and functions as an I/O node for providing I/O functionality with an external system for each node of the virtual tree. The global tree network may be configured to provide global barrier and interrupt functionality in asynchronous or synchronized manner. Thus, parallel algorithm processing operations, for example, employed in parallel computing systems, may be optimally performed in accordance with certain operating phases of the parallel algorithm operations. When implemented in a massively-parallel supercomputing structure, the global tree network is physically and logically partitionable according to needs of a processing algorithm.

CROSS-REFERENCE TO RELATED APPLICATION

[0001] The present invention claims the benefit of commonly-owned,co-pending U.S. Provisional Patent Application Serial No. 60/271,124filed Feb. 24, 2001 entitled MASSIVELY PARALLEL SUPERCOMPUTER, the wholecontents and disclosure of which is expressly incorporated by referenceherein as if fully set forth herein. This patent application isadditionally related to the following commonly-owned, co-pending UnitedStates Patent Applications filed on even date herewith, the entirecontents and disclosure of each of which is expressly incorporated byreference herein as if fully set forth herein. U.S. patent applicationSer. No. (YOR920,020,027US1, YOR920,020,044US1 (15270)), for “ClassNetworking Routing”; U.S. patent application Ser. No. (YOR920,020,028US1(15271)), for “A Global Tree Network for Computing Structures”; U.S.patent application Ser. No. (YOR920,020,029US1 (15272)), for ‘GlobalInterrupt and Barrier Networks”; U.S. patent application Ser. No.(YOR920,020,030US1 (15273)), for ‘Optimized Scalable Network Switch”;U.S. patent application Ser. Nos. (YOR920,020,031US1, YOR920,020,032US1(15258)), for “Arithmetic Functions in Torus and Tree Networks’; U.S.patent application Ser. Nos. (YOR920,020,033US1, YOR920,020,034US1(15259)), for ‘Data Capture Technique for High Speed Signaling”; U.S.patent application Ser. No. (YOR920,020,035US1 (15260)), for ‘ManagingCoherence Via Put/Get Windows’; U.S. patent application Ser. Nos.(YOR920,020,036US1, YOR920,020,037US1 (15261)), for “Low Latency MemoryAccess And Synchronization”; U.S. patent application Ser. No.(YOR920,020,038US1 (15276), for ‘Twin-Tailed Fail-Over for FileserversMaintaining Full Performance in the Presence of Failure”; U.S. patentapplication Ser. No. (YOR920,020,039US1 (15277)), for “Fault IsolationThrough No-Overhead Link Level Checksums’; U.S. patent application Ser.No. (YOR920,020,040US1 (15278)), for “Ethernet Addressing Via PhysicalLocation for Massively Parallel Systems”; U.S. patent application Ser.No. (YOR920,020,041US1 (15274)), for “Fault Tolerance in a SupercomputerThrough Dynamic Repartitioning”; U.S. patent application Ser. No.(YOR920,020,042US1 (15279)), for “Checkpointing Filesystem”; U.S. patentapplication Ser. No. (YOR920,020,043US1 (15262)), for “EfficientImplementation of Multidimensional Fast Fourier Transform on aDistributed-Memory Parallel Multi-Node Computer”; U.S. patentapplication Ser. No. (YOR9-20010211 US2 (15275)), for “A Novel MassivelyParallel Supercomputer”; and U.S. patent application Ser. No.(YOR920,020,045US1 (15263)), for “Smart Fan Modules and System”.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] This invention relates generally to the field ofdistributed-memory message-passing parallel computer design and systemsoftware, and more particularly, to a novel method and apparatus forinterconnecting individual processors for use in a massively-parallel,distributed-memory computer, for example.

[0004] 2. Discussion of the Prior Art

[0005] Massively parallel computing structures (also referred to as“ultra-scale computers” or “supercomputers”) interconnect large numbersof compute nodes, generally, in the form of very regular structures,such as grids, lattices or tori.

[0006] One problem commonly faced on such massively parallel systems isthe efficient computation of a collective arithmetic or logicaloperation involving many nodes. A second problem commonly faced on suchsystems is the efficient sharing of a limited number of external I/Oconnections by all of the nodes. One example of a common computationinvolving collective arithmetic operations over many compute nodes isiterative sparse linear equation solving techniques that require aglobal inner product based on a global summation.

[0007] While the three-dimensional torus interconnect computingstructure 10 shown in FIG. 1 comprising a simple 3-dimensional nearestneighbor interconnect which is “wrapped” at the edges, works well formost types of inter-processor communication, it does not perform as wellfor collective operations such as reductions, where a single result iscomputed from operands provided by each of the compute nodes 12, orefficient sharing of limited resources such as external I/O connections(not shown).

[0008] It would thus be highly desirable to provide an ultra-scalesupercomputing architecture that comprises a unique interconnection ofprocessing nodes optimized for efficiently and reliably performing manyclasses of operations including those requiring global arithmeticoperations such as global reduction computations, data distribution,synchronization, and limited resource sharing.

[0009] The normal connectivity of high-speed networks such as the torusare simply not fully suited for this purpose because of longerlatencies.

[0010] That is, mere mapping of a tree communication pattern onto thephysical torus interconnect results in a tree of greater depth than isnecessary if adjacent tree nodes are required to be adjacent on thetorus, or a tree with longer latency between nodes when those nodes arenot adjacent in the torus. In order to compute collective operationsmost efficiently when interconnect resources are limited, a true treenetwork is required, i.e., a network where the physical interconnectionsbetween nodes form the nodes into a tree.

SUMMARY OF THE INVENTION

[0011] It is an object of the present invention to provide a system andmethod for interconnecting individual processing nodes of a computingstructure so that they can efficiently and reliably compute globalreductions, distribute data, synchronize, and share limited resources.

[0012] It is another object of the present invention to provide anindependent single physical network interconnecting individualprocessors of a massively-parallel, distributed-memory computer that isarranged as a tree interconnect and facilitates global, arithmetic andcollective operations.

[0013] It is still another object of the present invention to provide anindependent single physical network interconnecting individualprocessors of a massively-parallel, distributed-memory computer that isarranged as a global tree interconnect for providing external(input/output) I/O and service functionality to one or more nodes of avirtual tree network which is a sub-tree of the physical network. Such aglobal tree interconnect system may include dedicated I/O nodes forkeeping message traffic off of a message-passing torus or grid computingstructure.

[0014] According to the invention, there is provided a system and methodfor enabling high-speed, low-latency global communications amongprocessing nodes interconnected according to a tree network structure.The global tree network optimally enables collective reductionoperations to be performed during parallel algorithm operationsexecuting in a computer structure having a plurality of theinterconnected processing nodes. Router devices are included thatinterconnect the nodes of the tree via links to facilitate performanceof low-latency global processing operations at nodes of the tree.Configuration options are included that allow for the definition of“virtual trees” which constitute subsets of the total nodes in the treenetwork. The global operations include one or more of: global broadcastoperations downstream from a root node to leaf nodes of a virtual tree,global reduction operations upstream from leaf nodes to root node in thevirtual tree, and point-to-point message passing from any node to theroot node in the virtual tree. One node of the virtual tree network iscoupled to and functions as an I/O node for providing I/O functionalitywith an external system for each node of the virtual tree. The globaltree network may be configured to provide global barrier and interruptfunctionality in asynchronous or synchronized manner. This is discussedin co-pending application U.S. patent application Ser. No.(YOR920,020,029US1 (15272)). Thus, parallel algorithm processingoperations, for example, employed in parallel computing systems, may beoptimally performed in accordance with certain operating phases of theparallel algorithm operations. When implemented in a massively-parallelsupercomputing structure, the global tree network is physically andlogically partitionable according to the needs of a processingalgorithm.

[0015] In a massively parallel computer, all of the compute nodesgenerally require access to external resources such as a filesystem. Theproblem of efficiently sharing a limited number of external I/Oconnections arises because the cost of providing such a connection issignificantly higher than the cost of an individual compute node.Therefore, efficient sharing of the I/O connections insures that I/Obandwidth does not become a limiting cost factor for system scalability.Assuming limited inter-processor interconnect, the most efficientnetwork for sharing a single resource, in terms of average latency, isthe global tree, where the shared resource is at the root of the tree.

[0016] For global and collective operations, a single, large tree may beused to interconnect all processors. However, filesystem I/O requiresmany, small trees with I/O facilities at the root. Because a large treecomprises multiple, smaller subtrees, the single, large tree may be usedfor filesystem I/O by strategically placing external connections withinit at the roots of appropriately-sized subtrees. Additionally,filesystem I/O requires point-to-point messaging which is enabled by thepresent invention and is not required for collective operations.

[0017] Advantageously, a scalable, massively parallel supercomputerincorporating the global tree network of the invention is well-suitedfor parallel algorithms performed in the field of life sciences.

BRIEF DESCRIPTION OF THE DRAWINGS

[0018] Further features, aspects and advantages of the apparatus andmethods of the present invention will become better understood withregard to the following description, appended claims, and theaccompanying drawings where:

[0019]FIG. 1 depicts a three-dimensional torus network interconnectingeight computing nodes;

[0020]FIG. 2 depicts an example of a typical system includingthirty-five (35) nodes (represented by circles), and a tree network 100connecting all the nodes.

[0021]FIG. 3 illustrates the basic architecture of a router deviceimplemented in the global tree network of FIG. 2.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0022] The present invention may be implemented in a computer structuresuch as described in herein-incorporated, commonly-owned, co-pendingU.S. patent application Ser. No. ______[YOR920,010,211US2, D#15275]which describes a novel Massively Parallel Supercomputer architecture inthe form of a three-dimensional torus designed to deliver processingpower on the order of teraOPS (trillion floating-point operations persecond) for a wide range of applications. The Massively Parallelsupercomputer architecture, in the exemplary embodiment described,comprises 65,536 processing nodes organized as a 64×32×32three-dimensional torus with each processing node connected to six (6)neighboring nodes 12. FIG. 1 shows such a torus consisting of eight (8)nodes 12, and it is clear to see how this interconnect scales byincreasing the number of nodes 12 along all three dimensions. Withcurrent technology, this architecture can be leveraged to hundreds ofteraOPS for applications that require significantly more computationthan communication or which require only nearest neighborcommunications. It should be understood that the present invention maybe implemented in many other computer structures besides asupercomputer.

[0023] As mentioned, the interconnect network connecting the torusprocessing nodes works well for most types of inter-processorcommunication but not for collective operations such as reductions,where a single result is computed from operands provided by each of thenodes.

[0024] As described in herein incorporated commonly-owned, co-pendingU.S. patent application Ser. No. ______ [D#15275], the most efficientmechanism for performing a collective reduction operation on the torus,in terms of minimum latency, is to provide a true tree network, i.e., anetwork where the physical interconnections between nodes form the nodesinto a tree.

[0025] Thus, according to a preferred embodiment of the invention, aglobal tree network is provided that comprises a plurality ofinterconnected router devices, one per node ASIC. Each router providesthree “child” ports and one “parent” port, each of which is selectivelyenabled. Two child ports are sufficient to create a tree topology. Morechildren reduce the height of the tree, or connections required to reachthe root. Thus, more children can reduce the latency for collectiveoperations at the expense of more interconnections. The tree is formedby starting with a “root” node that has no parent (i.e., nothingconnected to its parent port). The root node forms the topmost “level”of the tree. The next level down is formed by connecting one or more ofthe root's child ports to parent ports of other routers. In this case,the root node is the “parent” of the nodes in the level below it. Thisprocess continues recursively until nodes are reached that have nochildren (i.e., nothing connected to any of their router's child ports).These nodes are referred to as the “leaves” of the tree. For example, asshown in the example tree network 100 of FIG. 2, node B 110 is the rootnode, and the leaves are the nodes 120 at the bottom farthest from theroot node. As referred to herein, data moving up the tree, toward theroot, is referred to as “uptree” traffic while data traveling away fromthe root, toward the leaves is referred to as “downtree” traffic.

[0026] As will be described in greater detail, the tree network mayinclude a number of independent “virtual networks”, supported by virtualchannels on the links interconnecting the routers (nodes). In order toshare the links, virtual network data streams are packetized andinterleaved in a fair manner. Each of the virtual networks has its ownstorage resources, and a functional deadlock in one will not affect theother.

[0027] Each virtual network may be further subdivided into virtual trees(or sub-trees), which may or may not be independent (within each virtualnetwork). Any node may be configured to be the root of one of sixteenvirtual trees. A virtual tree comprises the node designated as the rootand all of its children, except a) nodes that are also designated asroots of the same virtual tree number, and b) children of nodessatisfying a). Therefore, the virtual trees with the same virtual treenumber cannot overlap, but virtual trees with different numbers can.

[0028] Nodes may be configured to participate in any number of virtualtrees, or none. If they participate, then they are expected to followall tree semantics, such as contributing operands to reductionoperations. As nodes may participate in multiple virtual trees, theymust specify a virtual tree number for every packet they inject into avirtual network.

[0029] An example tree structure 100 used in accordance with theinvention is shown in FIG. 2. More particularly, FIG. 2 depicts anexample of a virtual tree network including thirty-five (35) nodes(represented by circles), and a tree network 100 connecting all the 35nodes. The tree network 100 is used for global reductions and broadcastas will be described in greater detail. For the purpose of input/output(I/O) and resource sharing with external systems, the nodes of theexample virtual network 100 of FIG. 2 are grouped into five (5)non-overlapping virtual sub-trees referenced in FIG. 2 as virtual trees1-5. That is, each of the virtual sub-trees is indicated by a differentnumber within the circles. The respective nodes 111, 112, 113, 114 andnode 110 at the root of each respective sub-tree 1-5 includes aninterface connection to the external system (e.g. host or file system).Therefore, each I/O connection handles all the traffic for the seven (7)nodes of the sub-tree whose root it is connected to. In the preferredembodiment, the node at the root of each sub-tree is dedicated to I/O,however, this is not always required.

[0030] Referring to FIG. 2 and virtual tree number 1, with Node A 111 atits root, a typical node 115 desiring to send data out of the structurepasses a message up to the root node 111 of the virtual tree where it isforwarded to an external connection. Data arriving on the externalnetwork connection may be forwarded to a particular node such as 115 byusing a broadcast filter that filters out all other nodes as describedin greater detail herein. Further details regarding the operation of theglobal tree network, particularly with respect to functionalitysupporting programmable point-to-point or sub-tree messaging used forinput/output, program load, system management, parallel job monitoringand debug can be found in herein-incorporated, commonly-owned,co-pending U.S. patent application Ser. No. ______ [YOR982001-1002,YOR982001-1005 and YOR982001-1009].

[0031] Referring back to FIG. 2, in general, I/O traffic remains withinthe virtual trees that have an external connection at their roots.However, if an external connection fails, another node with an externalconnection may be used for fail-over. For example, if the external I/Oconnection to Node A 111 in FIG. 2 fails, then all of the nodes in thesub-tree whose root is Node A can communicate with the externalfilesystem or host system through the Node B 110.

[0032] It should be understood that the hardware functionality builtinto the tree 20 includes, but is not limited to, integer addition,integer maximum, minimum, bitwise logical AND, bitwise logical OR,bitwise logical XOR (exclusive OR) and broadcast. The functions areimplemented in the lowest latency manner possible. For example, theaddition function results in the lowest byte of the word being sentfirst on the global network. This low byte is immediately added to theother bytes (in hardware) from the other sources with the result beingshifted out to the next level of the tree. In this way, an 8-byte word,for example, has already progressed up several layers of the tree beforethe high order byte is shifted out. This results in the possibility fora very low latency addition over the entire machine. As is explained inco-pending application U.S. patent application Ser. No.(YOR920,020,031US1, YOR920,020,032US1 (15258)) entitled “ArithmeticFunctions in Torus and Tree Networks”, other arithmetic functions suchas minimum and subtraction can be accomplished by suitablepreconditioning of the data. Floating point summation can also beaccomplished by 2 passes on the tree, all at very low latency comparedto methods to accomplish this result without a global combining tree.Always an arithmetic or logical operation on the tree results in a flowup the tree, where all results are combined, and a subsequent flow fromthe root back down the tree, distributing the result to all branches. Aswill be described, certain branches can be omitted from the calculationin a controlled fashion.

[0033] In the preferred embodiment, the global tree network of thepresent invention comprises interconnected routers, one per node, thateach move data as well as compute collective reductions. FIG. 3illustrates the basic architecture of a router device 200 for the treenetwork of FIG. 2. As shown in FIG. 3, each router device 200 includes anumber of ports, e.g. four, that may either be connected to anotherrouter, or disabled when not connected. As shown in FIG. 3, the routerhas four input ports 210 a-213 a and corresponding four output ports 210b-213 b to form datapaths that permit a 3-tree to be constructed. In oneembodiment, only one of the four ports may be designated as a connectionto a parent node, and up to three of the remaining ports can beconnected to child nodes. A leaf node at the bottom of the tree willhave only the one port connected to its parent node enabled, while theroot of the tree will have no parent enabled, but at least one childenabled. It is understood that the datapaths are created through acrossbar switch 215 as shown in FIG. 3.

[0034] For purposes of description, in the router device 200 of FIG. 3,data always flows from left to right. Thus, a packet may enter therouter device 200 from either a local injection FIFO 202 or, one of therouter's input ports 210 a-213 a. If the packet enters a port, then itis placed into one of two input FIFOs (e.g., A or B) depending on whichof the two virtual networks it is. The packet is eventually consumed byeither logic and arithmetic operations executed by ALU unit 240 providedin uptree select block 220 or, the downtree select block 230. The resultof the uptree logic or the downtree selection is broadcast to all fouroutput stages 210 b-213 b, each of which may or may not handle itdepending on the operation and output ports it is destined for. Theselect blocks 220, 230 include an arbiter circuit (not shown) thatdecides where a packet (or packets) is (are) to move through the router.It is understood that there may be simultaneous uptree and downtreetraffic.

[0035] Software access to the tree is provided by the injection andreception interfaces 202, 204, and a set of configuration registers 218.In general, the configuration registers 218 are used to configure therouter and determine its status, while the injection and receptioninterfaces 202, 204 are used by applications to provide operands andreceive results respectively. More particularly, each virtual tree isconfigured by storing appropriate values into each router's virtual treeconfiguration registers 218 of which there is one per virtual tree. Foreach virtual tree, the configuration register permits a node to specifywhether or not it is to function as: 1) the root of the tree, 2) whetheror not it is participating in the tree, and/or 3) whether or not itshould force reception of uptree broadcast packets. In addition, thevirtual tree configuration register 218 enables a node to specify whichof its children either participate in the tree, or have participantsbelow them. This is necessary for supporting sparse trees.

[0036] Applications interact with the tree through the CPU injection 202and CPU reception 204 interfaces. Data is sent into the tree by beingstored as a packet into the injection interface 202, either explicitlyor through direct memory access (DMA). Similarly, results are removedfrom the tree by being read as a packet from the reception interface204, either explicitly or through DMA.

[0037] Although not shown, it is understood that a flow controltechnique is implemented between routers using, for example, atoken-based protocol that permits several packets worth of slack. Thatis, every output port 210 b-213 b that is enabled is connected to asingle input port of another router. Generally, each virtual channel ofthat input port grants the corresponding virtual channel of the outputport a token for every packet worth of buffer space in its input FIFO.The output port consumes, tokens as it sends packets, and the input portreturns tokens to the output port as if frees FIFO space. Therefore, theoutput port may continue to send packets as long as it has tokensavailable.

[0038] The Arithmetic and Logic Unit (ALU) block 240 within the routerdevice of the preferred embodiment is enabled to perform five reductionoperations on four operand sizes. The operations are integer addition,integer maximum, bitwise logical OR, bitwise logical XOR, and bitwiselogical AND. The operand sizes are 32 bits, 64 bits, 128 bits, and 2048bits. It should be understood that the architecture depicted in FIG. 3does not preclude a different choice for operations or operand sizes.Particularly, software is employed for selecting the operation andoperand size.

[0039] Typically, those nodes which participate in reduction operationsinject “reduction”-type packets by storing them in the CPU injectionFIFO 202. Reductions are performed at the granularity of packets, wherea packet, according to one embodiment, carries a payload of 256 bytes,for example. An individual packet will always carry operands of the samesize, and perform the same reduction on all of the operands. Any nodecan be configured not to participate in reductions for each virtualtree. In this case, the node will not supply any data to reductions andwill not receive results.

[0040] For each virtual tree, the router device 200 is configured tospecify which of its children will be participating in reductions. Whenit receives a reduction packet from each of its participating childrenand the local injection FIFO (unless the local node is notparticipating), it computes the specified reduction operation on thecontents of the packets and sends the results as a single packet to itsparent. That is, the first word of each packet is combined to producethe first word of the result packet. The second word of each packet iscombined to produce the second word of the result packet, and so forth.In this manner, the global result is recursively computed up the tree,finally completing at the root node of the reduction tree as a singlepacket containing the results.

[0041] Preferably, any node can be configured as the root of a virtualreduction tree. Once the reduction reaches that node, the single,combined packet is either received, broadcast to all of theparticipating children, or both. When a router receives a reductionpacket destined for a child node downtree, it forwards copies of thepacket to each of its children. It also places a copy of the packet inits local reception FIFO 204 if it is configured to participate inreductions on that virtual tree.

[0042] In a preferred embodiment, the width of the physical interconnectis narrower than the operand width, so operands are transmitted on thetree in a serialized manner. In order to achieve the lowest possiblelatency, integer operands are transmitted with the lowest order bitsfirst so that results can be calculated and even forwarded as operandsarrive. In this way, a result has potentially progressed up severallevels of the tree before its high order bits are shifted out, resultingin very low latency over all the nodes. It should be understood that thepipelined maximum operation is computed beginning with the wordcontaining the highest order bits because numbers are found to bedifferent based on the highest order bit in which they differ. Thehardware automatically reverses injected and received maximum operandsso that the computation is performed from high order to low order bits.

[0043] The integer reductions may additionally be used to computefloating point reductions. For example, a global floating point sum maybe performed by utilizing the tree two times, wherein the first time,the maximum of all the exponents is obtained, and in the second time,all the shifted mantissas are added.

[0044] As mentioned, the tree network 100 of the invention is an idealstructure for performing efficient global broadcasts. A hardwarebroadcast operation is always performed from the root of the tree, butany node may broadcast by first sending a point-to-point,“broadcast”-type message to the router device at the root node, whichthen starts the broadcast automatically. For the most part, globalbroadcasts respect the rules and restrictions of reductions, but differin their uptree behavior. Any node may perform a broadcast of a payloadby injecting a packet of the broadcast type on a virtual tree. Thepacket travels unaltered up the tree until it reaches a node configuredas the root of the virtual tree. There it is turned around and broadcastto all of the participating children on that virtual tree. Therefore, itwill only be received by those nodes participating in reductions on thatvirtual tree.

[0045] Reception of broadcasts, according to the invention, is furthercontrolled by filtering information included within the packet. Thefiltering mechanism of the preferred embodiment functions by matching avalue included in the packet to a preconfigured value stored in eachrouter, and only receiving the packet if the values match. In general,every node in the system is assigned a unique value (address), so thisbroadcast filtering mechanism allows a message to be sent from the rootnode to a single node below it. It is also possible to use non-uniqueaddresses to cause reception by a subset of the nodes. There are manyways in which broadcast filtering could be generalized. For example, useof a bit vector instead of an address would allow multiple, disjoint,configurable subsets of nodes to receive broadcasts.

[0046] Efficient sharing of external I/O connections is provided by acombination of broadcast filtering and a “root” packet type. Theroot-type packet always travels up a virtual tree until it encounters anode designated as a root of that tree, where it is unconditionallyreceived. This allows non-root nodes to send messages to the root, wherethey can be forwarded to the external connection. Data arriving on theexternal connection may be forwarded to a particular non-root node usinga filtering broadcast with an address that matches the intendeddestination.

[0047] If an external connection fails, the nodes using that connectionmay fail over to the next node up the tree with an external connection.For traffic from the nodes, this is performed by simply reconfiguringthe node at the failed external connection so that it no longer becomesthe root of the virtual tree, and reconfiguring the failover node as thenew root. Traffic to the nodes is more complicated because a broadcastfrom the failover root will go to all the children of that node, notjust the children below the failed node. For example, if node A 111fails over to node B 110 in FIG. 2, then packets from node B will bebroadcast to the entire tree.

[0048] In order to prevent unnecessary traffic, any router device may beconfigured to block downtree traffic on each virtual tree independently.Packets entering the router on the uptree link for a virtual tree thatis configured to block are simply dropped. For example, suppose that thenodes below node A 111 in FIG. 2 are using virtual tree labeled tree 1to send and receive external I/O using the connection at node A 111. Tofail the connection at node A over to node B, node B is configured to bethe root of virtual tree 1 instead of node A, and nodes C and D areconfigured to block downtree traffic on virtual tree 1. It should bereadily understood that this downtree blocking mechanism may be used ingeneral to prune virtual trees.

[0049] Any packet may be injected into the tree network with aninterrupt request attached. The eventual effect of this is to cause amaskable interrupt at every node that receives the packet or, in thecase of reductions, a result computed from the packet. A reductionresult will cause interrupts if any of the injected packets contributingto that result requested an interrupt. Furthermore, a global reductionoperation can be used to perform a software barrier with the interruptmechanism. Briefly, each node enters the barrier by clearing itsinterrupt flag and then contributing to the global reduction. It detectsthe completion of the barrier by polling on the interrupt flag orreceiving an interrupt. Further details regarding the operation of theglobal combining tree and barrier network may be found in detail inherein-incorporated, commonly-owned, co-pending U.S. patent applicationSer. No. ______ [YOR 8-2001-1009]

[0050] The tree network of the invention guarantees the correctcompletion of operations as long as they follow basic ordering rules.That is, because packets are processed by the routers 200 in the orderin which they are received, deadlock of a virtual network results if thenodes participating in operations on a virtual tree do not injectreduction operands in the same order, or fail to inject an operand.Similarly, deadlock may occur if two virtual trees overlap on the samevirtual network, and operand injection violates the strict ordering ruleof the virtual network. Preferably, there are no ordering restrictionson broadcast or point-to-point messaging operations, and theseoperations may be interleaved with reductions.

[0051] Guaranteed completion of correctly ordered operations is providedby a hardware error recovery mechanism. Briefly, each router retains acopy of every packet that is sends across a global tree network linkuntil it receives an acknowledgment that that packet was received withno error. A link-level communication protocol such as a sliding windowprotocol with packet CRC may be implemented that includes a mechanismfor detection of corrupted packets, and a mechanism to cause thosepackets to be retransmitted using the saved copy.

[0052] As mentioned, flow control is maintained through the use of atoken-based communication protocol. An “upstream” router sending packetsto a “downstream” router has some number of tokens which represent theamount of free storage capacity in the downstream router. Whenever theupstream router sends a packet, it consumes a token, and it cannot sendthe packet unless it has a token left. Conversely, the downstream routerissues tokens to the upstream router whenever it frees storage space.The balance between storage space and packet latency ensures that thelink be kept busy constantly.

[0053] In a downtree broadcast where a single packet is typically sentover multiple downtree links, as well as received locally, flow controlmay be implemented to prevent a packet from advancing until tokens areavailable on all of the downtree links and there is room in the CPUreceive FIFO 204. However, this conservative approach may affectthroughput for filtering broadcasts intended for a single destination,because that destination could be below a link that has tokens, whilethe packet waits on another link that does not. Thus, in the preferredembodiment, the tree network performs an “aggressive” broadcast, whichessentially decouples flow control on the individual downtree links.Referring to FIG. 3, a packet is forwarded to the Out FIFOs 250 of theappropriate downtree links and virtual network as soon as there issufficient storage space available in all of them. Each Out FIFO 250 isthen individually drained to its output port 210 b-213 b as tokensbecome available. Note that the individual copies of the packet must beplaced in each Out FIFO 250 anyway for the purpose of transmission errorrecovery through retransmission, described earlier.

[0054] In the preferred embodiment, as described in greater detail incommonly-owned, co-pending U.S. patent application Ser. No.(YOR9-20,010,211US2 (15275)) entitled “A Novel Massively ParallelSupercomputer”, and described herein with respect to FIGS. 1-3, eachprocessing node 12 is based on a system-on-a-chip process, i.e., allfunctions of the computer node, including the routing functions, areintegrated into a single ASIC, resulting in dramatic size and powerreduction for the node size. This supercomputer architecture is furtherleveraged to increase node density thereby decreasing the overallcost/performance for the machine. Each node preferably incorporates manysuch functions into the computer ASIC including, but not limited to: aPowerPC 440 embedded processing core, a Floating Point core, embeddedDRAM, integrated external DDR memory controller, message processor,Ethernet adapter, as well as the network routers. In one embodiment, thesame compute ASIC node may be used as an I/O node which is associatedwith a subset of the compute nodes, e.g. 64 nodes, for handlingfileserver communication and I/O operations. That is, the I/O nodes arevery similar to the compute nodes however, may differ only in therespect of external memory configuration and, in the addition of anexternal network interface, such as a Gigabit Ethernet, for example. Itshould be understood that the tree network router described herein canfunction as a stand-alone device in addition to the integrated device ofthe preferred embodiment.

[0055] While the invention has been particularly shown and describedwith respect to illustrative and preformed embodiments thereof, it willbe understood by those skilled, in the art that the foregoing and otherchanges in form and details may be made therein without departing fromthe spirit and scope of the invention which should be limited only bythe scope of the appended claims.

Having thus described our invention, what we claim as new, and desire tosecure by Letters Patent is:
 1. Apparatus for performing collectivereductions, broadcasts, and point-to-point message passing duringparallel algorithm operations executing in a computing structurecomprising a plurality of processing nodes, said apparatus comprising: aglobal tree network including routing devices interconnecting said nodesin a tree configuration, said tree configuration including one or morevirtual tree networks thereof, said global tree network enabling globalprocessing operations including one or more of: global broadcastoperations downstream from a root node to leaf nodes of specifiedvirtual tree networks, global reduction operations upstream from leafnodes to root node in said virtual tree, and point-to-point messagepassing from any node of said virtual tree to the root node of saidvirtual tree as required, wherein said global tree network and routingdevice configuration are optimized for providing low-latencycommunications in said computing structure.
 2. The apparatus as claimedin claim 1, wherein said computing structure includes a plurality ofprocessing nodes interconnected to form a first network, said one ormore virtual tree networks and said first network are collaboratively orindependently utilized according to bandwidth and latency requirementsof a parallel algorithm for optimizing parallel algorithm processingperformance.
 3. The apparatus as claimed in claim 1, wherein a root nodeof a virtual tree network functions as an I/O node including a highspeed connection to an external system, said. I/O node performing I/Ooperations for that virtual tree network independent of processingperformed in said first network.
 4. The apparatus as claimed in claim 3,wherein each said router includes input devices for receiving packetsfrom other nodes of a virtual tree, output devices for forwardingpackets to other nodes of said tree, a local injection device forinjecting packets into said tree, and a local reception device forremoving packets from said tree, said apparatus further including meansfor configuring said router to either participate or not participate insaid virtual tree.
 5. The apparatus as claimed in claim 4, wherein saidmeans for configuring said router further specifies participation ofsaid node as a root of a virtual tree for reduction operations.
 6. Theapparatus as claimed in claim 5, wherein said means for configuring saidrouter further specifies participation of input devices and the localinjection device for providing operands during reduction operations. 7.The apparatus as claimed in claim 6, wherein said router furtherincluding means for computing a specified reduction operation on packetcontents received by contributing input devices and the local injectiondevice if it is contributing, and means for causing transmission ofcomputation results to that node's upstream parent node via said outputdevice.
 8. The apparatus as claimed in claim 7, wherein said virtualtree network is programmed for recursively causing a global combinedresult to be computed up said virtual tree for completion as a singlepacket at said root node.
 9. The apparatus as claimed in claim 8,further including means for broadcasting a single, combined packet atsaid root to each all of the participating children configured tocontribute operands to reductions on that virtual tree.
 10. Theapparatus as claimed in claim 3, further including mechanism forenabling compute nodes to send point-to-point packets to an I/O node atthe root of a virtual tree that are destined for an external system viasaid high-speed connection.
 11. The apparatus as claimed in claim 9,further comprising filter mechanism for controlling reception ofbroadcast packets at nodes of a virtual tree, said reception being basedupon a node address and participation in said virtual tree.
 12. Theapparatus as claimed in claim 11, wherein each node includes an address,said system further comprising programmable means enablingpoint-to-point messaging among nodes of each said virtual tree, saidaddress enabling an external host system to directly communicate toevery node or a subset of the nodes.
 13. The apparatus as claimed inclaim 9, further including a mechanism for generating a hardwareinterrupt to a processor of a processing node based on the contents of apacket received by the local reception device.
 14. The apparatus asclaimed in claim 9, further including a mechanism for blockingunnecessary downtree traffic on each virtual tree independently.
 15. Theapparatus as claimed in claim 11, further comprising a mechanism forproviding flow control between routers when communicating packets. 16.The apparatus as claimed in claim 15, further comprising means enablingbroadcasting of packets on individual downstream links decoupled fromsaid flow control mechanism to perform aggressive broadcasting.
 17. Theapparatus as claimed in claim 2, wherein said first network includes ann-dimensional torus, where n is greater or equal to one.
 18. A methodfor performing collective reductions, broadcasts, and message passingduring parallel algorithm operations executing in a computer structurehaving a plurality of interconnected processing nodes, said methodcomprising: providing router devices for interconnecting said nodes vialinks according to a global tree network structure, said tree structureincluding one or more one or more virtual sub-tree structures; and,enabling low-latency global processing operations to be performed atnodes of said virtual tree structures, said global operations includingone or more of: global broadcast operations downstream from a root nodeto leaf nodes of specified a tree virtual sub-tree networks, globalreduction operations upstream from leaf nodes to root, node in saidtree, and point-to-point message passing from any node of said virtualtree to the root node of said virtual tree as required when performingsaid parallel algorithm operations.
 19. The method as claimed in claim18, wherein said computing structure includes a plurality of processingnodes interconnected to form a first network, said method furtherincluding the step of collaboratively or independently utilizing saidglobal tree network and first network in accordance with bandwidth andlatency requirements of a parallel algorithm for optimizing parallelalgorithm processing performance.
 20. The method as claimed in claim 18,wherein a root node of each virtual tree network functions as an I/Onode including a high-speed connection to an external system, saidmethod including the step of performing node I/O operations for thatvirtual tree network independent of operations performed in said firstnetwork.
 21. The method as claimed in claim 20, wherein each said routerincludes input devices for receiving packets from other nodes of avirtual tree, output devices for forwarding packets to other nodes ofsaid tree, a local injection device for injecting packets into saidtree, and a local reception device for removing packets from said tree,said method further including the step of configuring said router toeither participate or not participate in a virtual tree.
 22. The methodas claimed in claim 20, wherein said router configuring step furthercomprises the step of specifying participation of a node as a root of avirtual tree when performing reduction operations.
 23. The method asclaimed in claim 22, wherein said router configuring step furthercomprises one or more steps of: specifying participation of saidprocessing node coupled to said router for injecting operands duringreduction operations; and, specifying participation of said processingnode coupled to said router for injecting operands during reductionoperations.
 24. The method as claimed in claim 23, further comprisingthe steps of: configuring said router to compute a specified reductionoperation on packet contents received from contributing children nodesand said processing node, and causing transmission of computationresults to that node's upstream parent node via an output device. 25.The method as claimed in claim 24, further including the step ofrecursively causing a global combined result to be computed up saidvirtual tree for completion as a single packet at said root node. 26.The method as claimed in claim 25, further including the step ofbroadcasting a single, combined packet at said root to each all of theparticipating children configured to contribute operands to reductionson that virtual tree.
 27. The method as claimed in claim 20, furtherincluding the step of enabling compute nodes to send point-to-pointpackets to an I/O node at the root of a virtual tree that are destinedfor an external system via said high-speed connection.
 28. The method asclaimed in claim 26, further comprising the step of controllingreception of broadcast packets at nodes of a virtual tree, saidreception being based upon said address of said node and itsparticipation in said virtual tree.
 29. The method as claimed in claim28, wherein each node includes an address, said method furthercomprising the step of enabling point-to-point and sub-tree messagingamong nodes of each said virtual tree, said address enabling a hostsystem to directly communicate to every node or a subset of the nodes.30. The method as claimed in claim 26, further including the step of:generating a hardware interrupt to a processor of a processing nodebased on the contents of a packet received by the local receptiondevice.
 31. The method as claimed in claim 26, further including thestep of: independently blocking unnecessary downtree traffic on eachvirtual tree.
 32. The method as claimed in claim 28, further comprisingthe step of providing flow control between routers when communicatingpackets.
 33. The method as claimed in claim 32, further comprising thestep of enabling aggressive broadcasting of packets on individualdownstream links by decoupling said flow control mechanism.